[ManagedIdentity] Detect dead KeyGuard keys and purge orphan IMDSv2 mTLS certs on reboot#6037
Closed
gladjohn wants to merge 1 commit into
Closed
[ManagedIdentity] Detect dead KeyGuard keys and purge orphan IMDSv2 mTLS certs on reboot#6037gladjohn wants to merge 1 commit into
gladjohn wants to merge 1 commit into
Conversation
…TLS certs on reboot Fixes the post-reboot recovery path for IMDSv2 mTLS PoP token acquisition. On Azure VM restart the per-boot KeyGuard key (NCryptUsePerBootKeyFlag) is reaped by VBS, but the persisted binding cert under CN=managedidentitysnissuer.login.microsoft.com still references the old public key. The next call then either burns a failed TLS handshake before the reactive SChannel catch kicks in, or — in the zombie-handle variant — falls through entirely because the cert's modulus still matches the dead container. Changes ------- - Add CanSign liveness probe right after CngKey.Open in WindowsCngKeyOperations.TryGetOrCreateKeyGuard. 1-byte RSA-SHA256 PKCS1 sign; ~1-3ms, runs once per process (result is cached in WindowsManagedIdentityKeyProvider._cachedKey). Catches zombie-VBS state where Open succeeds but private material is dead. - Add PurgeManagedIdentityCertificates: one-shot issuer-CN substring sweep of CurrentUser\My, invoked at the moment a fresh KeyGuard key is minted (both the probe-failed path and the Open-threw path). Removes orphaned binding certs at the cause site so the next request doesn't pay any per-Read discovery cost and multi-identity hosts (SAMI + UAMIs sharing the KeyGuard container) are cleaned up uniformly. - Add 4 Windows-only unit tests for the purge filter behavior (matching, non-matching, case-insensitive, only-removes-matching). The reactive SChannel catch in ImdsV2ManagedIdentitySource is retained as a defensive backstop. Validation ---------- Validated E2E on a Server 2022 KeyGuard VM across multiple reboots and mixed SAMI/UAMI cases. Canonical post-reboot first call: - CngKey.Open threw CryptographicException HR=0x8009003A - Fresh KeyGuard key created - PurgeManagedIdentityCertificates removed orphan cert (Inspected=4) - MAA attestation OK - POST /issuecredential -> 200 - mTLS handshake -> 200 on first try (no reactive catch invoked) - Total ~2.8s on cold start Full unit suite green on net8.0: 2069 passed, 0 failed, 19 skipped. Refs #6031. Complementary to #6020 (cert-side modulus comparison): this PR adds the key-side liveness probe and broad issuer-CN sweep at the mint site. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR improves the post-reboot recovery path for IMDSv2 mTLS PoP on Windows KeyGuard VMs by proactively detecting stale per-boot KeyGuard keys and purging orphaned IMDSv2 binding certificates from the user cert store when a fresh key is minted.
Changes:
- Add an RSA signing liveness probe immediately after
CngKey.Opento detect “zombie” per-boot KeyGuard handles and recreate the key when necessary. - Add
PurgeManagedIdentityCertificatesto remove IMDSv2-issued binding certs fromCurrentUser\Mywhen the KeyGuard key is re-minted. - Add Windows-only unit tests validating the purge issuer-filter behavior.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| src/client/Microsoft.Identity.Client/ManagedIdentity/KeyProviders/WindowsCngKeyOperations.cs | Adds KeyGuard liveness probe and best-effort certificate-store purge for IMDSv2 issuer certs. |
| tests/Microsoft.Identity.Test.Unit/ManagedIdentityTests/WindowsCngKeyOperationsPurgeUnitTests.cs | Adds Windows-only unit tests for purge matching and non-matching issuer behavior. |
Comment on lines
+439
to
+457
| // Snapshot to avoid 'collection modified during enumeration' provider quirks. | ||
| var snapshot = new X509Certificate2[store.Certificates.Count]; | ||
| try | ||
| { | ||
| store.Certificates.CopyTo(snapshot, 0); | ||
| } | ||
| catch (Exception copyEx) | ||
| { | ||
| logger?.Info(() => | ||
| $"[MI][WinKeyProvider] PurgeManagedIdentityCertificates: store snapshot via CopyTo failed " + | ||
| $"({copyEx.GetType().Name}: {copyEx.Message}). Falling back to enumeration."); | ||
|
|
||
| int i = 0; | ||
| snapshot = new X509Certificate2[store.Certificates.Count]; | ||
| foreach (X509Certificate2 c in store.Certificates) | ||
| { | ||
| snapshot[i++] = c; | ||
| } | ||
| } |
Comment on lines
+474
to
+487
| try | ||
| { | ||
| store.Remove(candidate); | ||
| removed++; | ||
| logger?.Info(() => | ||
| $"[MI][WinKeyProvider] PurgeManagedIdentityCertificates: removed cert. " + | ||
| $"Thumbprint={thumb}, NotAfter={notAfter:O}, Issuer='{issuer}'."); | ||
| } | ||
| catch (Exception removeEx) | ||
| { | ||
| logger?.Info(() => | ||
| $"[MI][WinKeyProvider] PurgeManagedIdentityCertificates: failed to remove cert " + | ||
| $"Thumbprint={thumb}. {removeEx.GetType().Name}: '{removeEx.Message}'."); | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes the post-reboot recovery path for IMDSv2 mTLS PoP token acquisition on Azure VMs with KeyGuard.
On VM restart the per-boot KeyGuard key (
NCryptUsePerBootKeyFlag) is reaped by VBS, but the persisted binding cert underCN=managedidentitysnissuer.login.microsoft.cominCurrentUser\Mystill references the old public key. Two failure modes today:CngKey.OpenthrowsHR=0x8009003A(Cannot decrypt a VBS-isolated key.) — handled by minting fresh, but the orphaned cert was left behind, so a subsequent cold path could still surface it before the reactiveIsSchanelFailurecatch inImdsV2ManagedIdentitySourcekicks in.CngKey.Opensucceeds, the container's public-key metadata survived a reboot, but the private material is dead.ExportParameters(false)still returns the old modulus that matches the persisted cert, so any cert-side modulus comparison can't detect this case. SChannel handshake then fails and we fall through to the reactive catch.Changes
WindowsCngKeyOperations.csCanSignliveness probe right afterCngKey.Opensucceeds. 1-byte RSA-SHA256 PKCS1 sign; ~1–3 ms; runs once per process (the result is cached inWindowsManagedIdentityKeyProvider._cachedKeybehind aSemaphoreSlim). Catches the zombie-handle variant cleanly.PurgeManagedIdentityCertificates— one-shot, issuer-CN substring sweep (managedidentitysnissuer.login.microsoft.com, case-insensitive) ofCurrentUser\My, invoked at the moment a fresh KeyGuard key is minted (both probe-failed andOpen-threw branches).WindowsCngKeyOperationsPurgeUnitTests.csFour Windows-only unit tests for the purge filter behavior:
RemovesCertWithMatchingIssuerLeavesCertWithNonMatchingIssuerMatchIsCaseInsensitiveOnlyRemovesMatching_LeavesOtherCertsAloneTests use
CertificateRequest-based self-signed PFX with a discriminating Subject OU (MSAL-Purge-Test-<Guid>) andImdsV2TestStoreCleaner.RemoveAllTestArtifacts()in[TestInitialize].Retained
The reactive SChannel catch in
ImdsV2ManagedIdentitySource.AuthenticateAsyncis kept as a defensive backstop.Why purge at the mint site
When the container was just regenerated we know every cert under that issuer is orphaned by definition — no need to inspect them individually.
Readdiscovery cost on cold cache after reboot./issuecredentialPOST.Validation
Validated E2E on a Server 2022 KeyGuard VM across multiple reboots and mixed SAMI/UAMI cases. Canonical post-reboot first call:
Full unit suite green on
net8.0: 2069 passed, 0 failed, 19 skipped.Refs
Draft status
Opening as draft to gather feedback alongside #6020 before deciding the final shape. Happy to:
Readmodulus check on top, orCanSignprobe + issuer-CN sweep into Detect orphaned KG certs via public key modulus comparison #6020, or